[Test] Add some gsm8k configs for hybrid models.#35406
Conversation
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request adds several test configurations for gsm8k evaluation of hybrid models, specifically for Qwen3Next. My review found a critical issue in one of the new configuration files. The configuration for Qwen3-Next-FP8-TP4-MTP-Align enables prefix caching, which is not supported for hybrid models like Qwen3Next and will cause the engine to fail. This should be removed.
| --max-model-len 4096 | ||
| --tensor-parallel-size 4 | ||
| --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' | ||
| --enable-prefix-caching |
| @@ -0,0 +1,9 @@ | |||
| model_name: "Qwen/Qwen3-Next-80B-A3B-Instruct-FP8" | |||
There was a problem hiding this comment.
can we just have this one? or is it also useful to test the non spec decoding / non-prefix caching case
There was a problem hiding this comment.
It's useful yeah. I put the 3 configs in a "hybrid" folder so as not to pollute what's there.
Alternatively, if we want to keep the number of configs to a minimum, maybe it could be useful to be able to pass additional overrides when passing the configs to pytest (if that isn't possible already).
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
Purpose
This PR adds some configs to the gsm8k testing framework that are very helpful for development on the hybrid models. I found this super helpful for debugging something I'm working on right now related to MTP + prefix caching + async scheduling.
Test Plan
They can be run with:
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.